Image Captioning


Image captioning is the process of generating a textual description of an image. It uses both Natural Language Processing (NLP) and Computer Vision (CV) to generate the captions.

MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation

Add code
Apr 29, 2025
Viaarxiv icon

SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data

Add code
Apr 29, 2025
Viaarxiv icon

Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers

Add code
Apr 29, 2025
Viaarxiv icon

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

Add code
Apr 29, 2025
Viaarxiv icon

Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition

Add code
Apr 28, 2025
Viaarxiv icon

TextTIGER: Text-based Intelligent Generation with Entity Prompt Refinement for Text-to-Image Generation

Add code
Apr 25, 2025
Viaarxiv icon

Describe Anything: Detailed Localized Image and Video Captioning

Add code
Apr 22, 2025
Viaarxiv icon

Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning

Add code
Apr 21, 2025
Viaarxiv icon

Vision language models are unreliable at trivial spatial cognition

Add code
Apr 22, 2025
Viaarxiv icon

InstaRevive: One-Step Image Enhancement via Dynamic Score Matching

Add code
Apr 22, 2025
Viaarxiv icon